Replication for Efficiency and Fault Tolerance in a Dsm System
نویسنده
چکیده
Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory archi-tectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechanism. Our RDSM's design has focused on exploiting replication of data for both fault-tolerance and eeciency. This RDSM has been implemented on a NOW and performance evaluation shows the beneets of exploiting both types of replication to design an eecient, scalable and low-cost recoverable DSM.
منابع مشابه
Icare: Combining Efficiency and High-availability in a Dsm System
In light of the increasing throughput of local area networks, Networks Of Workstations (NOW) which provide a distributed shared memory (DSM) have become a convenient alternative to parallel architectures in the framework of parallel scientific applications. ICARE is a recoverable DSM based on backward error recovery which is implemented on top of an experiments ATM platform running the CHORUS m...
متن کاملImproving the Efficiency of Replication for Highly Reliable Systems
Fault Tolerance must be provided to increase system reliability. Combining efficiency with fault tolerance is a difficult task. Fault Tolerance requires the use of redundancy while efficiency requires the elimination of redundancy. Several fault tolerance techniques have been proposed in the literature to manage the redundancy existing in the system in order to provide fault tolerance. These te...
متن کاملFault tolerance and configurability in DSM coherence protocols
With the advent of large networks and the demand to have uninterrupted service, computer systems need to be more robust and fault tolerant. There are numerous ways to implement fault tolerance and recovery. A central concept in all these methods is the requirement for replicated data for high data availability. We believe that a protocol must not only provide replication, but do so at low opera...
متن کاملImproving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کاملFault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network
The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) is a low-latency, high-bandwidth interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over one hundred nodes. Each node has a dedicated output channel and an array of receivers, with one receiver dedicated to every other node’s output channel. The SOME-...
متن کامل